-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update probe-interval and stale contact point timeout calculation #2601
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question - if we have a bad probe value between intervals, i.e.
- 0:05 - hit node, it's good
- 0:10 - hit node, it's good
- 0:15 - no response from node
- 0:16 - join decider runs
Will that bad contact in step 3 invalidate the good contacts from earlier? It should, otherwise this will damage the algorithm. What happens there?
} | ||
else | ||
{ | ||
_staleContactPointTimeout = new TimeSpan((cps.ProbeInterval.Ticks + cps.ProbingFailureTimeout.Ticks) * 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
First of all, ContactPoint HTTP probing is done in parallel, so the first and second would happen pretty close to each other, then the failure happened. Second, Depends on the
|
Join decider ticks, discovery resolution ticks, and HTTP probing ticks happens in their own timers.
Discovery resolution tick timer interval is set to |
Question I'm really asking is that if a node somehow dies shortly after being successfully pinged, before a cluster is formed, is it possible to form a cluster under the |
I guess this is where our and scala implementation diverge with this update. Here, we're lenghtening that value so there is actually a possibility that there will be a race condition where a contact point is actually dead but its not being pruned out of the contact point list, if that makes any sense. |
So we should redesign this feature then - no need for a separate stale value, just always slide the timeout with the interval at all times (i.e it's never a hard timeout, it's always a |
|
ok, done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
{ | ||
_staleContactPointTimeout = new TimeSpan((cps.ProbeInterval.Ticks + cps.ProbingFailureTimeout.Ticks) * 2); | ||
} | ||
_staleContactPointTimeout = cps.ProbeInterval + cps.ProbingFailureTimeout; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Fixes #2571
Changes
stale-contact-point-timeout
HOCON settingBootstrapCoordinator
to( ProbeInterval + ProbeFailureTimeout )
, it wasProbeFailureTimeout
before.